Contributors: Reed Dalton, Corey Haines, Alex Jansen, McKenna Karnes

Probability Practice

Part A.

Given a single survey question presented to visitors of a particular website, what percentage of people who truthfully answered the question selected yes?
* There are two types of clickers: Random Clickers (RC), and Truthful Clickers (TC)
* There are two answers to the survey: yes and no * The expected fraction of Random Clickers (RC) is 0.3 * The expected fraction of Truthful Clickers (TC) is 0.7
* The probability that the Random Clickers (RC) selects yes is .5 * The probability that the Random Clickers (RC) selects no is .5
The results of the survey were such that 65% said Yes and 35% said No.

prob.yes = 0.65
prob.no = 0.35
prob.yes.given.RC = 0.5
prob.no.given.RC = 0.5
prob.TC = 0.7
prob.RC = 0.3
prob.yes.given.TC = (prob.yes - (prob.RC * prob.yes.given.RC))/prob.TC

prob.yes = prob.TC * prob.yes.given.TC + prob.RC * prob.yes.given.RC

The percentage of Truthful Clickers who answered “Yes” to the the simple survey is 0.7142857

Part B.

Medical test for disease using Bayes Formula.
* The sensitivity is about 0.993. That is, if someone has the disease, there is a probability of 0.993 that they will test positive.
* The specificity is about 0.9999. This means that if someone doesn’t have the disease, there is probability of 0.9999 that they will test negative.
* About 0.0025% of all people in the general population have the disease.
If someone tests positive for the disease, what is the probability that they are actually infected?

pop.infected = 0.000025
pop.clean = 0.999975

infected.pos = 0.993
infected.neg = 0.007

clean.pos = 0.0001
clean.neg = 0.9999
infected.test.pos = (infected.pos*pop.infected)/(infected.pos*pop.infected + clean.pos*pop.clean)
infected.test.pos = infected.test.pos * 100

If someone tested positive for the disease, there is only a 19.8882413% chance that they actually have the disease.

Given our calculations, implementing a universal testing policy for this disease would be problematic because for over 80% of positive test, the patient will not actually be infected with the disease. This could lead to mass hysteria and put unneeded strain on the population’s health systems.

Exploratory Analysis: Green Buildings

In order to determine if investing in a green certification for a new mixed-use building in East Austin would be economically advantageous for the developer, we attempted to provide a more convincing analysis than the preliminary recommendations made by the developer’s staff member.

Formulating our own independent analysis, we first narrowed down the available dataset of commercial rental properties to a subset that mimicked the environment presented at the East Austin development site. We decided to keep buildings with a very low occupation rates in the data set, as a worse case scenario example for our property in East Austin.

Once we settled on the particular subset of data we sought to analyze, we began to compare the cost of green buildings to nongreen buildings in each cluster, hoping to identify a premium for green buildings that was reflected in the rent price. Contrary to the on-staff statistician’s recommendation, we did not feel that seperating green building and non-green buildings was a good decision because there was important information captured in the cluster category, which the statistician removed from his analysis.

After determining that a premium did in fact exist, we drafted our own recommendation regarding the profitabity of “going green” on the development. In order to make such a recommendation, we needed to provide evidence for why the revenue generated from the green rent premium exeeded the costs of constructing a green building.

gb = read.csv('greenbuildings.csv')
rent.days=plot_ly(data=gb,x=~total_dd_07,y=~Rent)

Variable & Subset Selection

To create a target subset similar to the building being constructed, we needed to narrow down our dataset to the most relevant variables.

We first considered the role of utilities (total_dd_07,Gas_Costs, Electricity_Costs, net) on price. We found that reporting a high number of heating and cooling days per year did not correspond to relatively higher rents.

As a result, taking gas and electricity costs into account no longer intuitively felt useful in evaluating the Austin apartment’s rent. In some cases, utilities were even included in the building’s rent. Additionally, we did not know how the East Austin building planned to distribute the cost of utilities.

p = plot_ly(data=gb,x=~age, y=~stories)

Row Selection:
We decided to filter out building built over 50 years ago and building taller than 30 stories. This decision allowed us to capture a subset of building cluster more inline with the attributes of the proposed building in East Austin. (At the time of the new East Austin building’s opening,the building will be 0 years old and 15 stories tall).

gb = gb[gb$age <=50,]
gb = gb[gb$stories <=30,]
q = plot_ly(data=gb,x=~age, y=~stories)

The above plot reflects our new subset of data.

#filtering out green and non green buildings
greenbuildings = gb[gb$green_rating == 1,]  
nongreen = gb[gb$green_rating==0,]
#Range of green rents and non green rents
green.rent = plot_ly(data=greenbuildings,x=~Rent)%>% layout(title = "Rent of Green Buildings",  yaxis = list(title = "Frequency"))
non.green.rent = plot_ly(data=nongreen, x=~Rent)%>% layout(title = "Rent of NonGreen Buildings",  yaxis = list(title = "Frequency"))

Both bar charts have somewhat wide ranges. Most of the rents falls in between 0 and $80/sqft. Because of this, the on-staff statistician’s method of separating and merely taking averages wasn’t prudent. A more accurate insight could be attained by analyzing clusters of green and non green buildings with similar characteristics to the building planned for East Austin.

#Set up vectors to hold the ratios
greenratios = c()
nongreenratios = c()
greenratios = greenbuildings$Rent/greenbuildings$cluster_rent
nongreenratios = nongreen$Rent/nongreen$cluster_rent

#average
average.green = mean(greenratios)
average.nongreen = mean(nongreenratios)

#averages on averages
quotientavg = average.green/average.nongreen

We determined that the premium on green rent was 6.75%. This is larger than the 5% premium on green construction in the Austin area, so over the expected lifetime of the building, a green building would generate more revenue than a standard building. Therefore, we will advise the real-estate developer to proceed with building a green development.

Bootstrapping and ETF Returns

To understand the tradeoffs of risk and return in selecting Exchange Traded Funds (ETFs), we explored the growth over several years of historical data of the following ETFs:
+US domestic equities (SPY: the S&P 500 stock index)
+US Treasury bonds (TLT)
+Investment-grade corporate bonds (LQD)
+Emerging-market equities (EEM)
+Real estate (VNQ)
We then considered three portfolios each with varying levels of risk:
+Even Split: 20% of assets in each of the five ETFs
+Safer Split: investments assets in at least three classes
+Aggresive Split: consolidating assets into fewer classes with the promise of higher returns at the cost of accepting greater risk
*We have assumed each portfolio is rebalanced for free everyday i.e. with zero transaction costs

#Import the stocks
mystocks = c("SPY", "TLT", "LQD", "EEM", "VNQ")
getSymbols(mystocks)
## Warning: LQD contains missing values. Some functions will not work if
## objects contain missing values in the middle of the series. Consider using
## na.omit(), na.approx(), na.fill(), etc to remove or replace them.
## [1] "SPY" "TLT" "LQD" "EEM" "VNQ"
#Adjust for splits and dividends
SPYa = adjustOHLC(SPY)
TLTa = adjustOHLC(TLT)
LQDa = adjustOHLC(LQD)
EEMa = adjustOHLC(EEM)
VNQa = adjustOHLC(VNQ)

EDA Visualzation

SPYa.adj <- SPYa[,6]
TLTa.adj <- TLTa[,6]
LQDa.adj <- LQDa[,6]
EEMa.adj <- EEMa[,6]
VNQa.adj <- VNQa[,6]

SPYa1 = as.numeric(SPYa.adj[1])
TLTa1 = as.numeric(TLTa.adj[1])
LQDa1 = as.numeric(LQDa.adj[1])
EEMa1 = as.numeric(EEMa.adj[1])
VNQa1 = as.numeric(VNQa.adj[1])

SPYa.ts =SPYa.adj/SPYa1
TLTa.ts =TLTa.adj/TLTa1
LQDa.ts =LQDa.adj/LQDa1
EEMa.ts =EEMa.adj/EEMa1
VNQa.ts =VNQa.adj/VNQa1

basket <- cbind(SPYa.ts,TLTa.ts,LQDa.ts,EEMa.ts,VNQa.ts  )
zoo.basket <- as.zoo(basket)
tsRainbow <- rainbow(ncol(zoo.basket))
plot(x = zoo.basket, ylab = "Cumulative Return", main = "Cumulative Returns",col = tsRainbow, screens = 1)
legend(x = "topleft", legend = c("SPY", "TLT", "LQD", "EEM", "VNQ"), lty = 1,col = tsRainbow)

*Next we plotted the Cumulative Returns of each stock from 2007-01-3 to 2017-08-09.
+This allowed us to visualize each ETF’s gains/loses over time.
+Visually, it seems that LQD has less volatility than EEM.
+Rather than looking at this graph, we would like a way to standardize our comparison of the risk and return for investing in each ETF.
+We believe value at risk (VaR) will be a good comparison measure here.

Even Split Portfolio

#############################
#########Even Split Portfolio
#############################

#Setting up an all returns matrix showing closing prices for each day 
all_returns = cbind(ClCl(SPYa),ClCl(TLTa),ClCl(LQDa), ClCl(EEMa), ClCl(VNQa))
all_returns = as.matrix(na.omit(all_returns))

n_days = 20#i.e. four weeks

#Setting up the Even Split Portfolio and using Bootstrap to find VaR
initial.wealth = 100000
sim.even = foreach(i=1:5000, .combine = 'rbind')%do%{
  days = 20
  total.wealth = initial.wealth
  even.weight = c(0.2,0.2,0.2,0.2,0.2)
  even.holdings = even.weight*total.wealth
  even.wealthtracker = rep(0,days)
  #loop over four trading weeks
  for(today in 1:days){
    return.today = resample(all_returns, 1, orig.ids = FALSE)
    even.holdings = even.holdings + even.holdings*return.today
    total.wealth = sum(even.holdings)
    even.wealthtracker[today] = total.wealth
  }
  even.wealthtracker
}

#Calculate 5% VaR for even portfolio
even.split.VaR = quantile(sim.even[,n_days], 0.05) - initial.wealth

For our Even Split portfolio, we placed $20,000 in each of the 5 ETFs.

Even Split:
*At the five percent level, The value at risk for our even split portfolio is -6204.4200045.

Safer Split

##################
###### Safer Split
##################

safe_returns = cbind(ClCl(SPYa),ClCl(TLTa),ClCl(LQDa))
safe_returns = as.matrix(na.omit(safe_returns))

#Setting up the Safe Split Portfolio and using Bootstrap to find VaR

initial.wealth = 100000

sim.safe = foreach(i=1:5000, .combine = 'rbind')%do%{
  days = 20
  total.wealth = initial.wealth
  safe.weight = c(1/3,1/3,1/3)
  safe.holdings = safe.weight*total.wealth
  safe.wealthtracker = rep(0,days)
  #loop over four trading weeks
  for(today in 1:days){
    return.today = resample(safe_returns, 1, orig.ids = FALSE)
    safe.holdings = safe.holdings + safe.holdings*return.today
    total.wealth = sum(safe.holdings)
    safe.wealthtracker[today] = total.wealth
  }
  safe.wealthtracker
}

#Calculate 5% VaR for safe portfolio
safe.split.VaR = quantile(sim.safe[,n_days], 0.05) - initial.wealth

For the Safe Split, we limited our portfolio to only 3 ETFs made up of the S&P, Treasury Bonds, and Corporate Bonds. Treasury Bonds are a safe investment because they are guaranteed by the Federal Government and thus unless the government collapses, virtually risk free. Corporate Bonds are a safe investment because they will pay out unless the company defaults. And the S&P is a good sample of how the entire stock market moves, and is diversified enough to not be too influenced by fluctuations in a single sector.

Safe Split:
*At the five percent level, The value at risk for our even split portfolio is -3093.1892357.

Aggressive Portfolio

########################
####Aggressive Portfolio
########################
aggressive_returns = cbind(ClCl(EEMa),ClCl(VNQa))
aggressive_returns = as.matrix(na.omit(aggressive_returns))

#Setting up the Agressive Split Portfolio and using Bootstrap to find VaR

initial.wealth = 100000

sim.aggressive = foreach(i=1:5000, .combine = 'rbind')%do%{
  days = 20
  total.wealth = initial.wealth
  aggressive.weight = c(1/2,1/2)
  aggressive.holdings = aggressive.weight*total.wealth
  aggressive.wealthtracker = rep(0,days)
  #loop over four trading weeks
  for(today in 1:days){
    return.today = resample(aggressive_returns, 1, orig.ids = FALSE)
    aggressive.holdings = aggressive.holdings + aggressive.holdings*return.today
    total.wealth = sum(aggressive.holdings)
    aggressive.wealthtracker[today] = total.wealth
  }
  aggressive.wealthtracker
}

#Calculate 5% VaR for aggressive portfolio
aggressive.split.VaR = quantile(sim.aggressive[,n_days], 0.05) - initial.wealth

For the Aggresive Split, we shrunk our portfolio to only 2 ETFs made up of Real Estate and Emerging Market Equities. Real Estate is a risky investment as one could see from the 2008 housing bubble and the forclosures caused by the sub-prime mortgage crisis, but can grow quickly based on speculation. Emerging Market Equities have a high chance of failing due to geopolitics and various other factors, but there are high growth opportunities.

Aggresive Split:
*At the five percent level, the value at risk for our even split portfolio is -1.353049110^{4}.

Comparison:
Even Split - -6204.4200045
Safe Split - -3093.1892357
Aggresive Split - -1.353049110^{4}
The Safe Split had a 49% decrease in the value at risk than the Even Split. The Aggressive Split had a 112% increase in the value at risk than the Even Split. Our selections for the ETFs in our Safe Split portfolio and in our Aggressive Split portfolio were valid selections. Thus based on an investor’s time horizon and specific needs, each portfolio could be a desirable option.

NutrientH2O Market Segmentation

social_marketing = read.csv('social_marketing.csv')
smm = social_marketing[ , -which(names(social_marketing) %in% c("chatter","spam", 'uncategorized'))]
sm <- smm[,-1]/rowSums(smm[,-1], na.rm = FALSE)
sm_scaled <- scale(sm, center=TRUE, scale=TRUE) 

Pre-Proccessing

To start, we found the proportion of each follower’s tweets per category over their total number of tweets.
Next, we determined that there was likely no realistic way to use chatter and uncategorized to find interesting insight into NutrientH2O’s followers other than the propensity of each follower for tweeting. Because we already decided to scaled each category by the total number of tweets, we chose to drop these catagories.
Following the previous point, it did not seem like a feasible idea to create a marketing campaign based off of spam tweets.
+No account tweets out more than 2 tweets categorized as spam, and the highest percentage of an account’s tweets being categorized as spam was 7%.
+Thus for the simplicity of the model, we chose to drop the spam category.
While it is probably not a strong branding idea for NutrientH2O to develop a marketing campaign based on adult themes, we felt that leaving the adult category in the data made sense, as it had the posibility to capture some demographic information about NutrientH2O followers.
*Although potentially redundant, our final step was to center and scale our data to ensure that our kmean centers would be interpretable.

Determining ‘K’

Our goal is to group the categories of tweets into an optimal K-number of clusters.

We first experimented with finding the number of clusters with the highest CH(k) score when running kmeans++. From the graph, the highest score came when the data was grouped into 3 clusters.

ch.index = function(x,kmax,iter.max=100,nstart=10,algorithm="Lloyd") {
  ch = numeric(length=kmax-1)
    n = nrow(x)
    for (k in 2:kmax) {
        a = kmeanspp(x,k,iter.max=iter.max,nstart=nstart,algorithm=algorithm)
        w = a$tot.withinss
        b = a$betweenss
        ch[k-1] = (b/(k-1))/(w/(n-k))
        }
    return(list(k=2:kmax,ch=ch))
}
a1 = ch.index(sm_scaled, 10)
k1.hat = a1$k[which.max(a1$ch)]
plot(a1$k,a1$ch, xlab='K', ylab='CH(K)', type='b', main='K-Means Clustering : CH Index vs K' )

Next we ran Kmeans++ with 3 clusters, as Kmeans++ solves an augmented objective function that allows it to find the most optimal clustering faster than normal Kmeans.

maxCH = kmeanspp(sm_scaled, 3,iter.max=100,nstart=10,algorithm="Lloyd")

Finding Insights

In order to make some sort of inferences about the interests of NutrientH2O’s followers for target marketing and branding, we examined the distance of each category from the cluster center. We then sorted based on distance and created barcharts for each of the clusters.

ch_cluster_1 = sort(maxCH$centers[1,])
ch_cluster_2 = sort(maxCH$centers[2,])
ch_cluster_3 = sort(maxCH$centers[3,])
par(mfrow=c(1,3))
barchart(ch_cluster_2)

barchart(ch_cluster_3)

barchart(ch_cluster_1)

Inferences:
* Something about these three clusters doesn’t feel right.
* The most noticeable attribute is that the second cluster has no seperations in the categories. There are no distinct groupings on the high or low end.
* Intuitively, the fact that religion and parenting are the strongest factors in the cluster with the lowest sum of squared errors does not fit with our expectations of the primary interests of NutrientH2O’s followers.

Elbow Method

Continuing our exploration for the optimal number of cluster for use in kmeans, we implemented the Elbow Method.

The Elbow Method requires running kmeans or kmeans++ on multiple k-number of clusters and plotting them based on their sum of squared errors (SSE).
Next, A visual inspection is performed on the line chart. +If there is a noticable change in the direction of the line, the inflection point (i.e. the Elbow) is the ideal number of clusters for Kmeans/Kmeans++.

wss <- matrix(,nrow = length(2:10), ncol = 1)

  for (i in 2:10) wss[i] <- sum(kmeanspp(sm_scaled,i,iter.max=100,nstart=10,algorithm="Lloyd")$withinss)

plot(1:10, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares",
     main="Assessing the Optimal Number of Clusters with the Elbow Method",
     pch=20, cex=2)

From the graph, there ever so slightly appears be an inflection point at 6 clusters.

maxElbow = kmeanspp(sm_scaled, 6,iter.max=100,nstart=10,algorithm="Lloyd")

Creating 6 clusters using kmeans++ seems to produce segments of categories that are more interesting and with greater seperation than the 3 ckuster classification.

elbow_cluster_1 = sort(maxElbow$centers[1,])
elbow_cluster_2 = sort(maxElbow$centers[2,])
elbow_cluster_3 = sort(maxElbow$centers[3,])
elbow_cluster_4 = sort(maxElbow$centers[4,])
elbow_cluster_5 = sort(maxElbow$centers[5,])
elbow_cluster_6 = sort(maxElbow$centers[6,])
#Clusters plotted in order of decreasing withinss
barchart(elbow_cluster_6)

Based on categorical seperation, we can interpret each cluster (in order of increasing sum of squared error of each cluster) as following:

*politics and news have the most seperation out of the strongest cluster. As Donald Trump has shown over the last 2 and a half years, Twitter is one of the most populat places on the internet for people to espouse their political views. While we don’t find out much demographic information from this first cluster, we can make the assumption that many of NutrientH2O’s followers consume media, and especially media related to news and politics.
+These users were least interested in: cooking, health nutrition, and photo sharing.

barchart(elbow_cluster_2)

*health_nutrition,personal_fitness, and outdoorshave the most seperation out of the second strongest cluster. These users seem to be in the more stereotypical category of NutrientH2O consumers. +These users were least interested in: politics,sports_fandom,and photo_sharing.

barchart(elbow_cluster_1)

*photo_sharing and shopping have the most seperation out of the third strongest cluster. Stereotypically, these activities are more feminine than masculine activites. +These users were least interested in: food,cooking, and health_nutrition

barchart(elbow_cluster_3)

*college_uni and online_gaming have the most seperation out of the fourth strongest cluster. These users seem to be your stereotypical college males. +These users were least interested in: cooking,personal_fitness, and health_nutrition

barchart(elbow_cluster_4)

*cooking,fashion, and beauty have the most seperation out of the fith strongest cluster. Another cluster of categories that can generally be thought of as more feminine than masculine activites. +These users were least interested in: politics, food,and sports_fandom

barchart(elbow_cluster_5)

*religion,parenting, andsports_fandomhave the most seperation out of the weakest cluster. This chart is almost identical to the strongest clustering of categories when the number of clusters was 3. The fact that the two clusters are so similar corroborates our decision to explore other numbers of cluster than 3. +These users were least interested in: health_nutrition,politics, and photo_sharing

Conclusion

From our observations of the most right of center categories in each clusters, we can make some generalizations about NutrientH2O’s followers. One shocking observation is that sports_playing doesn’t appear in the top categories of any of the clusters. This is generally the first activity associated with sports drinks. Next, it across most clusters, the demographic of NutrientH2O’s followers is a younger demographic. the weakest cluster is the only cluster that seems to suggest users that are GenXers or parents. *There are more clusters that are dominated by stereotypically “feminine,” activities. There is one category that coresponds to inactive, videogame-playing college aged males, and one for middle aged people.